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Abstract 

A major task of evolutionary biology is the reconstruction of phylogenetic trees from molecular data. 
The evolutionary model is given by a Markov chain on a tree. Given samples from the leaves of the 
Markov chain, the goal is to reconstruct the leaf-labelled tree. 

It is well known that in order to reconstruct a tree on n leaves, sample sequences of length f2(log n) 
are needed. It was conjectured by M. Steel that for the CFN/Ising evolutionary model, if the mutation 
probability on all edges of the tree is less than p* = {y/2 — l)/2 3 / 2 , then the tree can be recovered from 
sequences of length O(logn). The value p* is given by the transition point for the extremality of the 
free Gibbs measure for the Ising model on the binary tree. Steel's conjecture was proven by the second 
author in the special case where the tree is "balanced." The second author also proved that if all edges 
have mutation probability larger than p* then the length needed is n ni ~ 1 \ 

Here we show that Steel's conjecture holds true for general trees by giving a reconstruction algorithm 
that recovers the tree from O (log n) -length sequences when the mutation probabilities are discretized 
and less than p* . Our proof and results demonstrate that extremality of the free Gibbs measure on 
the infinite binary tree, which has been studied before in probability, statistical physics and computer 
science, determines how distinguishable are Gibbs measures on finite binary trees. 

Keywords: Phylogenetics, CFN model, Ising model, phase transitions, reconstruction problem, Jukes 
Cantor. 

1 Introduction 

In this paper we prove a central conjecture in mathematical phylogenetics USteOll : We show that every 
phylogenetic tree with short, discretized edges on n leaves can be reconstructed from sequences of length 
O(logn), where by short we mean that the mutation probability on every edge is bounded above by the 
critical transition probability for the extremality of the Ising model on the infinite binary tree. 
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This result establishes that the extremality of the free Gibbs measure for the Ising model on the infi- 
nite binary tree, studied in probability, statistical physics and computer science, determines the sampling 
complexity of the phylogenetic problem, a central problem in evolutionary biology. We proceed with back- 
ground on the phylogenetic problem, on the reconstruction problem and a statement of our results. 



Phylogenetic Background. Phylogenies are used in evolutionary biology to model the stochastic evolution 
of genetic data on the ancestral tree relating a group of species. The leaves of the tree correspond to (known) 
extant species. Internal nodes represent extinct species. In particular the root of the tree represents the most 
recent ancestor of all species in the tree. Following paths from the root to the leaves, each bifurcation indi- 
cates a speciation event whereby two new species are created from a parent. We refer the reader to [SS03 ] for 
an excellent introduction to phylogenetics. The underlying assumption is that genetic information evolves 
from the root to the leaves according to a Markov model on the tree. It is further assumed that this process 
is repeated independently a number of times denoted k. Thus each node of the tree is associated with a 
sequence of length k. The vector of the i'th letter of all sequences at the leaves is called the i'th character. 
One of the major tasks in molecular biology, the reconstruction of phylogenetic trees, is to infer the topology 
of the tree from the characters at the leaves. 

In this paper we will be mostly interested in two evolutionary models, the so-called Cavender-Farris- 
Neyman (CFN) IICav78 1 lFar73l |Ney 7 1 [ and Jukes-Cantor (JC) 1170691 models. In the CFN model the states 
at the nodes of the tree are and 1 and their a priori probability at the root is uniform. To each edge e 
corresponds a mutation probability p(e) which is the probability that the state changes along the edge e. 
Note that this model is identical to the free Gibbs measure of the Ising model on the tree. See |Lyo89[ . In 
the JC model the states are A, C, G and T with a priori probability 1/4 each. To each edge e corresponds a 
mutation probability p(e) and it is assumed that every state transitions with probability p(e) to each of the 
other states. This model is equivalent to the ferromagnetic Potts model on the tree. 

Extremality and the Reconstruction Problem. A problem that is closely related to the phylogenetic prob- 
lem is that of inferring the ancestral state, that is, the state at the root of the tree, given the states at the 
leaves. This problem was studied earlier in statistical physics, probability and computer science under the 
name of reconstruction problem, or extremality of the free Gibbs measure. See | Spi75| Hig77 IGeo881. The 



reconstruction problem for the CFN model was analyzed in [BRZ95, EKPSOO, Iof96 ( BKMP05, MSW04b ( 
BCMR06]. In particular, the role of the reconstruction problem in the analysis of the mixing time of Glauber 
dynamics on trees was established in [ BKM P05I IMS W04bl . 

Roughly speaking, the reconstruction problem is solvable when the correlation between the root and 
the leaves persists no matter how large the tree is. When it is unsolvable, the correlation decays to for 
large trees. The results of IIBRZ95I IEKPS00I Hof96l IBKMP05I IMS W04bl lBCMR06ll show that for the CFN 
model, if for all e it holds that p{e) < p max < p* , then the reconstruction problem is solvable, where 

* ~ 1 ~ 1 SO/ 

p = =— w 15%. 

If, on the other hand, for all e it holds that p(e) > p m - m > p* and the tree is balanced in the sense that all 
leaves are at the same distance from the root, then the reconstruction problem is unsolvable. Moreover in 
this case, the correlation between the root state and any function of the character states at the leaves decays 
as re~ n W. 

Our Results. M. Steel USteOH conjectured that when < p mm < p(e) < p max < p* for all edges e, 
one can reconstruct with high probability the phylogenetic tree from 0(log n) characters. Steel's insightful 
conjecture suggests that there are deep connections between the reconstruction problem and phylogenetic 
reconstruction. 
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This conjecture has been proven to hold for trees where all the leaves are at the same graph distance from 
the root — the so-called "balanced" case — in [Mos04|. It is also shown there that the number of characters 
needed when p(e) > p m - m > p*, for all e, is n^ l \ The second result intuitively follows from the fact that 
the topology of the part of the tree that is close to the root is essentially independent of the characters at the 
leaves if the number of characters is not at least rf 1 ^ 1 '. 

The basic intuition behind Steel's conjecture is that: since in the regime where p(e) < p m ax < P* there 
is no decay of the quality of reconstructed sequences, it should be as easy to reconstruct deep trees as it 
is to reconstruct shallow trees. In [ESSW99] (see also [Mos07]) it is shown that "shallow" trees can be 
reconstructed from O(logn) characters if all mutation probabilities are bounded away from and 1/2 (the 
results of [ESSW99 ] also show that in this regime, any tree can be recovered from sequences of polynomial 
length). The same high-level reasoning has also yielded a complete proof that O(logn) characters suffice 
for a percolation-type mutation model when all edges are short [MS04]. See [MS04] for details. 

Here we prove Steel's conjecture for general trees under the assumption that the mutation probabilities 
are discretized. We show that, if < p m i n < p(e) < p max < p* for all edges e of the tree, then the tree 
can be reconstructed from c(p m i n ,p max )(log n + log 1/5) characters with error probability at most 5. The 
discretization assumption amounts to assuming that all edge lengths are a multiple of a small constant A. 
(See below for a formal definition of "edge length.") This result further implies that sequences of logarithmic 
length suffice to reconstruct phylogenetic trees in the Jukes-Cantor model, when all the edges are sufficiently 
short. 

Compared to [Mos04], our main technical contribution is the design and analysis of a tree-building pro- 
cedure that uses only "local information" to build parts of the tree, while also maintaining "disjointness" 
between the different reconstructed subtrees. The disjointness property is crucial in order to maintain the 
conditional independence properties of the Gibbs distribution on the original tree — for the purpose of per- 
forming estimation of the states at internal nodes of the tree. Note that in the balanced case of [Mos04 ] this 
property can be achieved in a straightforward manner by building the tree "level-by-level." 

1.1 Formal Definitions and Main Results 

Trees and Path Metrics. Let T be a tree. Write V(T) for the nodes of T, £{T) for the edges of T and 
C(T) for the leaves of T. If the tree is rooted, then we denote by p(T) the root of T. Unless stated otherwise, 
all trees are assumed to be binary (all internal degrees are 3) and it is further assumed that C(T) is labelled. 

Let T be a tree equipped with a length function on its edges, d : £{T) — > M + . The function d will also 
denote the induced path metric on V(T): d(v,w) = ^{d(e) : e G path T (u, w)}, for all v,w G V(T), 
where path T (x, y) is the path (sequence of edges) connecting x to y in T. 

We will further assume below that the length of all edges is bounded between / and g for all e G £ (T). 
In other words, for all e G £ (T), / < d(e) < g. 

Markov Model of Evolution. The evolutionary process is determined by a rooted tree T = (V, E) 
equipped with a path metric d and a rate matrix Q. We will be mostly interested in the case where 
Q = ( ^ } 1 ) corresponding to the CFN model and in the case where Q is a 4 x 4 matrix given by 
Qij = 1 — 4 • t{i = j} corresponding to the Jukes-Cantor model. To edge e of length d(e) we associate 
the transition matrix M e = exp(d(e)Q). 

In the evolutionary model on the tree T rooted at p each vertex iteratively chooses its state from the state 
at its parent by an application of the Markov transition rule M e , where e is the edge connecting it to its 
parent. We assume that all edges in E are directed away from the root. Thus the probability distribution on 
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the tree is the probability distribution on {0, 1} ({A, C, G,T} ) given by 

m = <°(p)) n M ll7uw 

where tt is given by the uniform distribution at the root, so that 7r(0) = 7r(l) = 1/2 for the CFN model and 
tt(A) = 7r(C) = tt(G) = vr(T) = 1/4 for the JC model. We let the measure fi denote the marginal of ~p on 
the set of leaves which we identify with [n] = {1, . . . , n}. Thus 

The measure fi defines the probability distribution at the leaves of the tree. 

We note that both for the CFN model and for the JC model, the transition matrices M e admit a simple 
alternative representation. For the CFN model, with probability p(e) = (1 — exp(— 2d(e)))/2, there is 
a transition and, otherwise, there is no transition. Similarly for the JC model with probability p(e) = 
(1 — exp(— 4d(e)))/4 each of the three possible transitions occur. In particular, defining 

we may formulate the result on the reconstruction problem for the phase transition of the CFN model as 
follows: "If d(e) < g < g* for all e then the reconstruction problem is solvable." 



Phylogenetic Reconstruction Problem. We will be interested in reconstructing phylogenies in the regime 
where the reconstruction problem is solvable. The objective is to reconstruct the underlying tree whose 
internal nodes are unknown from the collection of sequences at the leaves. Since for both the CFN model 
and the JC model, the distribution Jl[a] described above is independent of the location of the root we can 
only aim to reconstruct the underlying unrooted topology. Let T represent the set of all binary topologies 

above, which correspond to distances d satisfying: 



(that is, unrooted undirected binary trees) and the family of CFN transition matrices, as described 



0<f<d<g<g*, 

where g* is given by (Jlj) and / is an arbitrary positive constant. Let T (g) ,M9^ N denote the set of all 
unrooted phylogenies, where the underlying topology is in T and all transition matrices on the edges are 
in M^™. Rooting T £ T (g> Mfg at an arbitrary node, let /U*r be the measure at the leaves of T as 
described above. It is well known, e.g. HESS W991 ICha961 , that different elements in T<2> Ai correspond 
to different measures; therefore we will identify measures with their corresponding elements of T ® A4^™. 
We are interested in finding a (efficiently computable) map ^ such that ^(o-q, . . . ,ag) G T, where ag = 
{( T d)^- 1 are k characters at the leaves of the tree. Moreover, we require that for every distribution [it S 
T (8) .A/fy™, if the characters cri,, . . . , are generated independently from fiT, then with high probability 
^(ctJ,, . . . , <7gj) = T. The problem of finding an efficiently computable map ^ is called the phylogenetic 
reconstruction problem for the CFN model. The phylogenetic reconstruction problem for the JC model is 
defined similarly. In [ESSW99], it is shown that there exists a polynomial-time algorithm that reconstructs 
the topology from k = poly(n, log 1/S) characters, with probability of error 5. Our results are the following. 
We first define a subset of T M^™ . In words, the A-Branch Model (A-BM) is a subset of T ® M^™ 
where the edge lengths d(e), e S E, are discretized. This extra assumption is made for technical reasons. 
See Section O 
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Definition 1.1 (A-Branch Model) Let A > 0. We denote by y[A] the subset ofT <g> My™ where all 
d(e) 's are multiples of 'A. We call y[A] f/j<? A-Branch Model (A-BM). 



Theorem 1 (Main Result) Consider the A-Branch Model above with: < / < g < g* and A > 

0. Then there exists a polynomial-time algorithm that reconstructs the topology of the tree from k = 
c(f,g, A) (log n + log 1/5) characters with error probability at most 5. In particular, 

c(f, 9 ,A)- C{9) 



min{A 2 , f 2 } 
Moreover, the value g* given by ([7J is tight. 

Corollary 2 (Jukes-Cantor Model) Consider the JC model on binary trees where all edges satisfy 

< / < g < g* c , where g* JC := g*/2. 

Under the A-BM, there exists a polynomial-time algorithm that reconstructs the topology of the tree from 
c'(f, g, A) (log n + log 1/5) characters with error probability at most 5 and 

c {f, 9, A) - 



min{A 2 , f 2 } 



The value g* c corresponds to the so-called Kesten-Stigum bound [KS66] on the reconstruction threshold, 
which has been conjectured to be tight for the JC model [MM06\ \Sly08]/ . 

Theorem [T] and Corollary [2] extend also to cases where the data at the leaves is given with an arbitrary level 
of noise. For this robust phylogenetic reconstruction problem both values g* and g* c are tight. See |JM04]. 

The results stated here were first reported without proof in [DMR06]. Note that in [DMR06] the result 
were stated without the discretization assumption which is in fact needed for the final step of the proof. This 



is further explained in subsection 7.3 



1.2 Organization of the Paper 

Roughly speaking, our reconstruction algorithm has two main components. First, the statistical component 
consists in 

1. [Ancestral Reconstruction] the reconstruction of sequences at internal nodes; 

2. [Distance Estimation] the estimation of distances between the nodes. 

The former — detailed in Section [2] — is borrowed from [Mos04] where Steel's conjecture is proved for the 
special case of balanced trees. In general trees, however, distance estimation is complicated by nontrivial 
correlations between reconstruction biases. We deal with these issues in Section [3] 

Second, the combinatorial component of the algorithm — which uses quartet-based ideas from phyloge- 
netics and is significantly more involved than [Mos04] — is detailed in Sections|4]and[5] A full description of 
the algorithm as well as an example of its execution can also be found in Section[5] Proof of the conectness 
of the algorithm is provided in Section [6] and [7] 
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1.3 Notation 



Throughout we fix < / < g < g' < g* , A > 0, and 7 > 3. By definition of g* , we have 2e 4s > 1 and 
2e" 49 ' > 1. We let 9 = e~ 2g and & = e~ 29 ' . 



2 Reconstruction of Ancestral Sequences 

Ancestral Reconstruction. In this section we state the results of [Mos04l on ancestral reconstruction 
using recursive majority and we briefly explain how these results are used in the reconstruction algorithm. 
Following [Mos04], we use state values ±1 instead of 0/1. Furthermore, we use the parameter 9(e) = 
1 - 2p{e) = e - 2d( - e \ Note that 9(e) measures the correlation between the states at the endpoints of e. 
Because the CFN model is ferromagnetic, we have < 9(e) < 1. In terms of 9 we have reconstruction 
solvability whenever 9(e) > 9 > 9*, for all edges e, where the value 9* satisfies 26 2 = 1. 

For the CFN model both the majority algorithm | Hig77[ and the recursive majority algorithm [Mos98^] 



are effective in reconstructing the root value. Note that for other models, in general, most simple recon- 
struction algorithms are not effective all the way to the reconstruction threshold [Mos OTl [MP03I IJM04I. 
However, as noted in |Mos()41 , there is an important difference between the two reconstruction algorithms 
when different edges e have different values of 9(e). Suppose that 6(e) > 9' > 6* for all edges e. Then, the 
recursive majority function is effective in reconstructing the root value with probability bounded away from 
1/2 (as a function of 9'). On the other hand, it is easy to construct examples where the majority function 
reconstructs the root with probability tending to 1 /2 as the tree size increases. 

The difference between the two algorithms can be roughly stated as follows. When different edges have 
different ^-values, different "parts" of the tree should have different weights in calculating the reconstructed 
value at the root. Indeed, when all ^-values are known, an appropriate weighted majority function can 
estimate the root value correctly with probability bounded away from 1/2 [MP03]. However, when the 
0-values are unknown, using uniform weights may result in an arbitrarily inaccurate estimator. 

Recursive majority, on the other hand, does not require knowledge of the ^-values to be applied success- 
fully. This essentially follows from the fact that the majority function is "noise-reducing" in the following 
sense. Suppose 6' > 9*. Then, as we shall see shortly, there exists an integer I and noise level q < 1/2 such 
that majority on the ^-level binary tree has the following property: if all leaf values are given with stochastic 
noise at most q, then the majority of these values differs from the actual root state with probability at most q. 
Therefore, recursive application of the majority function — I levels at a time — results in an estimator whose 
error is at most q for any number of levels. 

Properties of Recursive Majority. We proceed with a formal definition of recursive majority. Let Maj : 
{ — 1,1} — > {—1,1} be the function defined as follows 



ign (y^; x i + 0.5i^ 



Maj(xi, ..., Xd ) = si 

where a; is ±1 with probability 1/2 independently of the Xj's. In other words, Maj outputs the majority 
value of its input arguments, unless there is a tie in which case it outputs ±1 with probability 1/2 each. 
For consistency, we denote all statistical estimators with a hat. Note in particular that our notation differs 
from HMos04L 



1 See below for a definition of these estimators. 
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Next we define the "noise -reduction" property of majority. The function 77 below is meant to measure 
the noise level at the leaves. 



Definition 2.1 (Correlation at the root) Let T = (V, E) be a tree rooted at p with leaf set dT. For func- 
tions 9 : E -> [0, 1] and fj : dT -> [0, 1], let CFN(fl, fj) be the CFN model on T where 9(e) = 9(e) for all 
e which are not adjacent to dT, and 9(e) = 9(e)fj(v) for all e = (u, v), with v £ dT. Let 



MajCorr(£,r?) = E +Maj(a dT ) 
where a is one sample drawn from CFN(#, fj). 



+1 



E 



-MajOar) 



Proposition 2.2 (Noise reduction of majority I Mos04 1) Let b and 6W1 be such that b9^ ain > h? > 1. 
Then there exist I = £(b, 9 m - m ), a = a(b, 6 m m) > h > 1 and (5 = f3(b, 9 m i n ) > 0, such that any 
CFN(8, 77) model on the i-level b-ary tree satisfying min eg £ 9(e) > 9 m \ a and min^gg-r r](v) > r] m i n > 
also satisfies: 

MajCorr(6»,r7) > min{ar7 min ,/3}. (2) 



In particular ifrj m i n > (3 then: 



MajCorr(0, 77) > (3. 



(3) 



General Trees. Recursive application of Proposition 2.2 allows the reconstruction of the root value on any 



balanced binary tree with correlation at least /3. However, below we consider general trees. In particular, 



when estimating the sequence at an internal node u of the phylogenetic tree, we wish to apply Proposition 2.2 
to a subtree rooted at u and this subtree need not be balanced. This can be addressed by "completing" the 
subtree into a balanced tree (with number of levels a multiple of £) and taking all added edges to have length 



d(e) = 0, that is, 9(e) = 1. Fix i and j3 so as to satisfy Proposition 2.2 with 6 = 2 and # m i n = 9. (Recall 
that for the proof of Theorem[T]we assume 9(e) > 9 > 9* , for all edges e.) Consider the following recursive 
function of x = (x\, X2, ■ ■ ■)'■ 

Maj^(xi) = xi, 

Maj^xi, . . .,x 2J e) = Maj (MajJ 

for all j > 1. Now, if gqt is a character at the leaves of T, let us define the function Anc p ^T(o~dT) that 
estimates the ancestral state at the root p of T using recursive majority as follows: 

1 . Let T be the tree T (minimally) completed with edges of 9- value 1 so that T is a complete binary tree 
with a number of levels a multiple of I, say J£; 

2. Assign to the leaves of T the value of their ancestor leaf in T under agr', 

3. Let a be the leaf states of T arranged in pre-order; 

4. Compute 

AiiCp^(o-aT) ■= Maj^ (5"). 



From Proposition 2.2 we get: 
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Proposition 2.3 (Recursive Majority) Let T = (V, E) be a tree rooted at p with leaf set dT. Let oqt be 

one sample drawn from CFN(#, rj) with 9(e) > 9 > 9* for all edges e and rj(v) = 1 for all v E dT. Then, 
we can choose t and (3 > so that 



Anc P)T (cra T ) = a p 



+ 1 



AnCp )T (cr aT ) = a f 



-1 



> 



1 + 13 



(4) 



2.3 



with 9(e) = e~ 2d( - e l Also, if a dT = (4 T )^ =1 is 



In the remainder of the paper, we will use Proposition 

a collection of k characters at the leaves of T, we extend the function Anc p> r to collections of characters in 
the natural way: 

(5) 



Anc PtT (o- dT ) := (Anc p tT (cr q T )^) ^ . 



3 Distance Estimation 

Throug hout this section, we fix a tree T on n leaves. Recall the definition of our path metric d from 
Section [T7T| We assume that o-qt = {c t g T )^_ 1 are k i.i.d. samples (or characters) at the leaves of T 
generated by the CFN model with parameters d(e) < g for all e G £ (T) (a lower bound on d is not required 
in this section, but it will be in the next one). We think of uqt = ((c* ) u &ffr) k t - 1 as n sequences of length k 
and we sometimes refer to a u = (cr*)^ =1 as the "sequence at u". 

Distance between leaves. As explained in Section |1.1[ a basic concept in phylogenetic reconstruction 
is the notion of a metric on the leaves of the tree. The distance between two leaves is a measure of the 
correlation between their respective sequences. We begin with the definition of a natural distance estimator. 



Definition 3.1 (Correlation between sequences) Let u, v be leaves ofT. The following quantity 



Dist {a u ,a v ) = -- In 



1 

1 Yj 



t=l 



(6) 



is an estimate of d(u, v). In cases where the sum inside the log is non-positive, we define Dist (a u , a v ) = 
+oo. 

The next proposition provides a guarantee on the performance of Dist. The proof follows from standard 
concentration inequalities. See, e.g., [ESSW99 ]. An important point to note is that, if we use short sequences 
of length 0(log n), the guarantee applies only to short distances, that is, distances of order 0(1). 

Proposition 3.2 (Accuracy of Dist) For all e > 0, M > 0, there exists c = c(e, M; 7) > 0, such that if 
the following conditions hold: 

• [Short Distance] d(u, v) < M, 

• [Sequence Length] k = c' log n, for some d > c, 



then 



with probability at least 1 — 0{n 



d(u, v) — Dist(<7 u , o v ) 



< e, 
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Proof: By Azuma's inequality we get 



Dist(cr u , a v ) > d(u, v) + e 



t=l 
k 



-2d(u,v)-2s 



< 



t=i 
k 



^aX<E[aX]-(l-e- 2£ )e" 



-2d(u,v) 



2M 



t=l 




^ _ e -2e )e - 2M^ 

2k(2/kf 



with K depending on M and e. Above, we used that 



d(u,v) = --lnE[o-^]. 



A similar inequality holds for the other direction. 



Distance Between Internal Nodes. Our reconstruction algorithm actually requires us to compute dis- 
tances between internal nodes. Note that, in general, we do not know the true sequences at internal nodes of 
the tree. Therefore, we need to estimate distances by applying Q to reconstructed sequences. An obvious 
issue with this idea is that reconstructed sequences are subject to a systematic bias. 

Definition 3.3 (Bias) Suppose v is the root of a subtree T v ofT. Let a v = Anci, ,t„(c"9t„)- Then, the 
quantity 

\(T v ,v) = -^ln(E[al-dl} + ), 
is called the reconstruction bias at v on T v . (Note the similarity to (|6]).J We denote 

B(g) = sup X(T v ,v), 

where the supremum is taken over all rooted binary trees with edge lengths at most g. Note that for all g, 



< g < g*, we have by Proposition 2.3 



< B(g) < +oo. 



We also denote 



-2B0) 



Since B(g) > 0, we cannot estimate exactly the internal sequences. Hence Proposition 3.2 cannot be 
used directly to estimate distances between internal nodes. We deal with this issue as follows. Consider the 
configuration in Figure [BTTj More precisely: 



Definition 3.4 (Basic Disjoint Setup (Preliminary Version)) Root T at an arbitrary vertex. Note that, by 
reversibility, the CFN model on a tree T can be rerooted arbitrarily without changing the distribution at 
the leaves. Denote by T x the subtree of T rooted at x. We consider two internal nodes U\, u<i that are 
not descendants of each other in T. Denote by Vtp,Wp the children of in T U(fi , ip = 1,2. We call this 
configuration the Basic Disjoint Setup (Preliminary Version). 
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We need a few basic combinatorial definitions. 

Definition 3.5 (Restricted Subtree) Let V' C V be a subset of the vertices ofT. The subtree of T restricted 
to V is the tree T 1 obtained by 1) keeping only nodes and edges on paths between vertices in V and 2) by 
then contracting all paths composed of vertices of degree 2, except the nodes in V. We sometimes use the 



notation T\y> := T". See Figure 3.2 for an example. 



Definition 3.6 (Quartet) A quartet is a set of four leaves. More generally, for any set Q = { v i , wi , V2 , W2 } 
of four nodes of T, we think of Q as a quartet on T\q. We say that a quartet Q = {v\, w\, V2, W2} is 
nondegenerate if none of the nodes in Q are on a path between two other nodes in Q. There are three 
possible (leaf -labeled) topologies on a nondegenerate quartet, called quartet splits, one for each partition 



of the four leaves into two pairs. In Figure 3.1 the correct quartet split on Q is {{i>i, W\}{v2, W2}} which 
we denote by viWi\v2W2- 

To compute the distance between u\ and U2, we think of the path between u\ and i±2 as the internal edge 
of the quartet Q = {vi, ^1,^2,^2}- The reconstructed sequences at x G Q also suffer from a systematic 



bias. However, we prove in Proposition 3.8 that the bias does not affect the computation of the length of 



the internal edge of the quartet. Indeed, as depicted in Figure 3.3 we treat the systematic errors introduced 
by the reconstruction as "extra edges" attached to the nodes in Q = {vi, w\, v%, W2}, with corresponding 
"endpoints" Q = {v%, vb\, v>2, ^2}- Our estimator Int is then obtained from a classical distance-based 
computation — known in phylogenetics as the Four-Point Method BBun71ll — applied to the "extra nodes" Q 
(see Q below). For this idea to work, it is crucial that the systematic error in the reconstructed sequences 
at x G Q be independent given the true sequences at x G Q. This is the case when u\ and 112 are not 
descendant of each other because of the Markov property. 

We now define our distance estimator Int. We let a x = Anc^?^ {&8T X ) be the reconstructed sequence at 

x. 
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Definition 3.7 (Distance Estimator Int) Consider the Basic Disjoint Setup (Preliminary Version). Then, 
we let 

Int (a Vl ,a Wl ;a V2 ,a W2 ) = - (l)ist(a Vl ,a V2 ) +Dist(S : tUl ,(T u , 2 ) - Dist(d Vl ,a Wl ) - Dist(a V2 ,a W2 )^j . (7) 
If one of the Dist quantities is +00, the function Int is set to +00. 



The next proposition provides a guarantee on the performance of Int. As in Proposition 3.2 the guaran- 
tee applies only to O(l) distances, if we use sequences of length 0(log n). 

Proposition 3.8 (Accuracy of Int) Consider the Basic Disjoint Setup ( Preliminary Version ). Let Q = 
{^1,^1,^2,^2}- For all e > 0, M > 0, there exists c = c(e, M;7, g) > such that, if the following 
hold: 



• [Edge Lengths] For all e G £(T), d(e) <g< g*; 

• [Short Distances] We have d(x, y) < M, Vx, y £ Q; 

• [Sequence Length] The sequences used have length k = c' log n, for some c' > c; 
then, with probability at least 1 — 0(n~ 7 ), 



d(iii,-u 2 ) - Int (a Vl 



< e. 



Proof: First note that, for all t G {1, . . . , k}, conditioned on {o^jzeg, the values {cr* } X &Q are independent 
by the Markov property. Therefore, as we observed before, the systematic errors introduced by the r econ - 
struction process can be treated as "extra edges", with endpoints Q = {vi, w\, V2, ^2} (see Figure 
Moreover, from it follows that, Vx G Q, Vi G {1, . . . , k}, 



3.3 1 



[°x = a x\ a x\ > 2 : 



(8) 



where (3(g) > is specified by Definition 3.3 From the lengths of the "extra edges" — equal to the 
reconstruction biases — are at most B(g). 



Appealing to Proposition 3.2 with constants | and M + 2B(g) and the union bound, we argue that with 
probability at least 1 — 0(n -7 ) for all x,y G Q 



Dist^cry) - d(x,y) 



< 



2' 



where we extend d to the "extra nodes" x, y by setting 



d(x, x) 



and likewise for y (see Figure 3.3 1. Observing that d(u±, 112) = \ (d(vi, V2) + d(wi,W2) — d(vi, w\) — d(v2,W2)) 
and using the above we get 



d{ui,U2) - Int (a Vl 



< e. 
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Algorithm DistanceEstimate 

Input: Two nodes u\, ui\ a rooted forest T\ accuracy radius i? acc ; 
Output: Distance estimate v; 

• [Children] Let w v , v v be the children of u v in T for <p= 1, 2 (if u v is a leaf, set «;„, = u y = u^,); 

• [Sequence Reconstruction] For x € wi, i>2, W2}, set a x — Anc x> x^{o'dT J ')\ 

• [Accuracy Cutoff] If there is {x, y} C {ti 1; toi, «2, ^2} suc h that 

Dist (a x ,a y ) > R 

acc * 

return +00; 

• [Distance Computation] Otherwise return 

v = Int (ct„ 1 , d Wl ; a.„ 2 , d W2 ) . 



Figure 3.4: Routine DISTANCEESTIMATE. 

"3 




Figure 3.5: The subtrees T\ {uitU2jU3jUs} and T\ {uA ^ tUf .^ } are edge-disjoint. The subtrees T\ {uijU5iU6iUs} 
and T l{« 2 ,«3,«4,«7} ^ edge-sharing. 



Distances Between Restricted Subtrees. In fact, we need to apply Proposition 3.8 to restricted subtrees 
of the true tree. Indeed, in the reconstruction algorithm, we maintain a "partially reconstructed subforest" of 
the true tree, that is, a collection of restricted subtrees of T. For this more general setup, we use the routine 



DistanceEstimate detailed in Figure 3.4 



To generalize Proposition 3.8 we need a few definitions. First, the notion of edge disjointness is bor- 
rowed from I1Mos071 . 

Definition 3.9 (Edge Disjointness) Denote by path^rr, y) the path (sequence of edges) connecting x to y 
in T. We say that two restricted subtrees 71, 7*2 ofT are edge disjoint if 

path T (a;i,yi) n path T (x 2 , 2/2) =0, 

for all x\,yi £ £(71) and X2,y2 G £(72). We say that 71, T2 are edge sharing if they are not edge disjoint. 
See Figure \33\ for an example. (IfT% and T2 are directed, we take this definition to refer to their underlying 
undirected version.) 
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Definition 3.10 (Legal Subforest) We say that a tree is a rooted full binary tree if all its internal nodes have 
degree 3 except for the root which has degree 2. A restricted subtree T\ ofT is a legal subtree ofT if: 

1. It is a rooted full binary tree, that is, if it has a unique node of degree 2 — its root; 

2. And if all its leaves are leaves ofT. 

Moreover, we say that a forest 

F = {Ti, ■ ■ ■ ,T a }, 

is legal subforest ofT if the T^'s are edge-disjoint legal subtrees ofT. We denote by p(F) the set of roots 
ofT. 

Definition 3.11 (Dangling Subtrees) We say that two edge-disjoint legal subtrees T\, T2 ofT are dangling 

if there is a choice of root p* for T not in T\ or T<i that is consistent with the rooting of both T\ and T% that 
is, pointing the edges ofT away from p* is such that the edges ofT\ and T2 are directed away from their 
respective roots. 



See Figure 3.7 below for an example where two legal, edge-disjoint subtrees are not dangling. We generalize 



the basic configuration of Figure BA\ as follows: 



Definition 3.12 (Basic Disjoint Setup (Dangling)) Let T\ = T Ul and T2 = T U2 be two legal subtrees of 
T rooted at u\ and u% respectively. Assume further that T\ and T2 are edge-disjoint and dangling. Denote 
by v v , Wu, the children of in T„, (p = 1,2. If u v is a leaf, we let instead v v = Wtp = u v . We call this 
configuration the Basic Disjoint Setup (Dangling). 



We make another important remark about the more general setup considered here. Note that even though 
the true tree satisfies the "short edge length" condition (that is, d(e) < g) by assumption, partially recon- 
structed subtrees may not — because their edges are actually paths in the true tree. Therefore, an important 
step of the reconstruction algorithm is to make sure that all restricted edges of the partially reconstructed 
subforest are short enough for recursive majority to be accurate. (This explains why we need to use two 
constants g < g' below g* .) Refer to routine LocalCherry in Figure 5.4 in Section[5j 

We then obtai n the following generalization of Proposition |3.8| Note that the routine DlSTANCEEs- 



TIMATE in Figure 3.4 uses an "accuracy cutoff" i? acc . As we show in the proof below, this ensures that 



distances that are too long are rejected, in which case the value +00 is returned. 



Proposition 3.13 (Accuracy of DlSTANCEESTIMATE) Consider the Basic Disjoint Setup (Dangling). Let 
Q = {vx,wi,V2, W2}- Foralle > 0, M > 0, and R^c > M+2B(g'), there exists c = c(e , M , R acc ; j , g') > 
such that, if the following hold: 

• [Edge Length] It holds that d(e) < g' < g* , Ve G £ (T x ), x £ Q; 

• [Sequence Length] The sequences used have length k = d log n, for some d > c; 



then, with probability at least 1 — 0{n 7 ), the following holds: letting v be the output of DlSTANCEESTI- 
MATE in Figure 3.4 we have that if one of the following hold 

1. [Short Distances] We have d(x, y) < M, \/x, y £ Q; 

2. [Finite Estimate] v < +00; 
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then 



\d(u\,v,2) — v\ < e. 



Proof: The first half of the proposition follows immediately from Propositions 3.2 and 3.8 Refer to Figure 



The second part follows from a "double window" argument as in [ESSW99, Theorem 9]. Let < r < \ 



such that 



R acc = M + 2B{g') - - ln(2r) 



By assumption, v < +oo and therefore for all pairs {x, y} C {v\,Wi,V2, W2} we have 

fXst(a x ,a v ) <M + 2B{g') - -ln(2r). 
It follows directly from Azuma's inequality that a pair {x, y} such that 



(9) 



d(x,y) >M + 2B{g')- -lnr, 



satisfies ([9]) with probability at most 



Dist (a x , d y ) < M + 2B(g') - - In 2r 



< 



< 



k 

\iZ^y^ 2Te ~ 2[M+2B{9 ' )] 

t=l 



2[M+2B(g')} 



t=l 



exp 



re 



-2[M+2B(g')} 



2k(2/kf 



n 



-c'K 



for a constant K > depending on M, g', r. The result follows by applying the first part of the proposition 
with M replaced by M + 2B{g') - | lnr. (Note that d(x, y) < d(x, y), for all x, y G Q.) ■ 

We also need a simpler variant of DistanceEstimate (Figure [3~4]> whose purpose is to test whether 



the internal path of a quartet is longer than g. We record this variant in Figure 3.6 



Detecting Long Distances When T\ And T2 Are Not Dangling. Roughly speaking, our reconstruction 
algorithm works by progressively merging subtrees that are close in the true tree. (See Section[5]for further 
details.) Hence, the algorithm needs to tell whether or not two subtrees are sufficiently close to be considered 
for this merging operation. However, as we explain in the next sections, we cannot guarantee that the Basic 
Disjoint Setup (Dangling) in Proposition |3.13 applies to all situations encountered during the execution of 
the algorithm. Instead we use a special "multiple test" to detect long distances. This test is performed by 



the routine DistortedMetric detailed in Figure 3.8 



The routine has a further important property. During the course of the algorithm, since we only have 
partial information about the structure of the tree, we may not always know whether or not two subtrees 
are dangling — and therefore whether or not DistanceEstimate in Figure 3.4 returns accurate distance 
estimates. The "multiple test" in DistortedMetric is such that, if the routine returns a finite estimate, 
that estimate is accurate. We proceed with an explanation of these properties. 



We further generalize the basic configuration of Figure 3.1 as follows: 



15 



Algorithm IsShort 

Input: Two pairs of nodes (v\, W\), (i>2, W2)', rooted forest T\ accuracy radius i? acc ; tolerance e; 
Output: Boolean value and length estimate; 

• [Sequence Reconstruction] For x G {vi, wi, w 2 }, set a x = Anc-j-j^ergj^); 

• [Accuracy Cutoff] If there is {x, y} C {vi, w\, V2, w 2 } such that 

Dist (S x ,d y ) > -Race, 

return +00; 

• [Internal Edge Length] Set 

v = Int (a Vl , a wi ; a V2 , a W2 ) ; 

• [Test] If v < g + e return (true, v), o.w. return (FALSE, 0); 



Figure 3.6: Routine IsShort. 

Definition 3.14 (Basic Disjoint Setup (General)) Let T\ = T Xl and T2 = T X2 be two restricted subtrees 
ofT rooted at x\ and xi respectively. Assume further that T\ and T2 are edge-disjoint, but not necessarily 
dangling. Denote by y^, z v the children of in T v , cp = 1,2. Let w v be the node in T where the path 
between T% and T2 meets T„, ip = 1,2. Note that w v may not be in since T v is restricted, ip = 1,2. 
Ifw<p 7^ x ip> assume without loss of generality that is in the subtree ofT rooted at Zu>, or on the edge 
(xtp,Zv)), tp = 1,2. We call this configuration the Basic Disjoint Setup (General). See Figure 3.7 Let 
d(Ti,T'2) be the length of the path between w\ and W2 (in the path metric d). 

The key point to note is that when computing the distance between y\ and yi rather than the distance 



between x\ and X2, then the assumptions of Proposition 3.13 are satisfied (as the claim holds in this case 
by rooting the tree at any node along the path connecting w\ and w-z). Hence, if T\ and T2 are far apart, 
the distance between y\ and y^ is correctly estimated as being large. On the other hand, if T\ and T2 are 
dangling and close, then distances between all pairs in {yi, z\} x {1/2, Z2} are accurately estimated as being 
short. 



Proposition 3.15 (Accuracy of DlSTORTEDMETRIC) Consider the Basic Disjoint Setup (General) with 
T = {Ti, T2} and Q = {yx, Zi 1 y2, z-i\- For all e > 0, M > 0, and _R a cc > M + 2B(g') + 4g' there exists 
c = c(e, M, R^c, 7, g') > such that, if the following hold: 

• [Edge Length] It holds that d(e) < g' < g*, Ve G £(T X ), x G Q; also, d(x v , y^), d(x ip , z v ) < g', 
<p = l,2; 

• [Weight Estimates] We are given weight estimates 

h : 8{T) -> M+, 
such that \ h(e) - d(e)\ < e/16, Ve G £{T); 

• [Sequence Length] The sequences used have length k = c' log n, for some c' > c; 



then, with probability at least 1 — 0(n 7 ), the following holds: letting v be the output of DlSTORTEDMET- 
RIC in Figure 3.8 we have that if one of the following hold 

I. [Dangling Case] T\ and T2 are dangling and d(T\, T2) < M; 



16 




Figure 3.7: Basic Disjoint Setup (General). The rooted subtrees Ti, T2 are edge-disjoint but are not assumed 
to be dangling. The white nodes may not be in the restricted subtrees T\,T2- The case w\ = x\ and/or 
W2 = X2 is possible. Note that if we root the tree at any node along the dashed path, the subtrees rooted at 
?/i and 2/2 are edge-disjoint and dangling (unlike Ti and T2). 



2. [Finite Estimate] v < +00; 
then 



\v — d{x\,x 2 )\ < e. 



Proof: Choose c as in Proposition BT3 with parameters e/16 and M + 4c/. As in Figure [378] let 



T>'(ri,r 2 ) := DlSTANCEESTIMATE(n, r 2 ; T\ R^c) - h(xi,rx) - h(x 2 ,r 2 ), 
for all pairs (n, r%) G {yi, z±} x {2/2, ^2} (where DistanceEstimate is defined in Figure [374]). In Case 



1, the result follows directly from Proposition 3.13 and the remarks above Proposition 3.15 In particular, 
note that, with probability 1 — 0(n~ 7 ), 



\T>'(r 1 ,r 2 ) - d(x 1 ,x 2 )\ < 



3e 
16' 



for all (n, r 2 ) G {2/1, 21} x {2/2, ^2} and therefore 



max 



> \ r™) G {2/1, Zi} x {2/2, z 2 }, ^ = 1, 2} < 2 < e/2, 



and DistortedMetric returns a finite value (namely v := V(z\, Z2); see Figure 3.8 1 that is accurate 
within 3e/16 < e. 

In Case 2, the condition v < +00 implies in particular that all four distance estimates are equal up 
to e/2. From the remark above the statement of the proposition, at least one such distance, say T>'(yi, 2/2) 
w.l.o.g., is computed within the Basic Disjoint Setup (Dangling) and we can therefore apply Proposition |3.13| 
again. In particular, we have 

\V(yi,y 2 ) - d(xi,x 2 )\ < —, 

lb 
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and therefore 

\v - d(x 1 ,x 2 )\ < - + — < e, 
2 lb 

(where again v := T>'(zi, z%)). ■ 

Remark. Note that DistortedMetric outputs +00 in two very distinct situations. In particular, if 
DistortedMetric outputs +00, then either the subtrees are too far to obtain an accurate estimate or T\ 
and T 2 are not dangling (or both). This convention will turn out to be convenient in the statement of the full 
algorithm. 



Algorithm DistortedMetric 

Input: Two nodes x\, x 2 \ a rooted forest T\ edge lengths {h(e)} ee jr\ accuracy radius i? acc ; tolerance e; 
Output: Distance v; 

• [Children] Let y v , z v be the children of x v in T for <p = 1, 2 (if Xo, is a leaf, set z v — y v — xJ); 

• [Distance Computations] For all pairs (7*1, r 2 ) E {yi, Zi} x {j/2, ^2}, compute 

V'(r 1 ,r 2 ) := DlSTANCEESTlMATE(ri, r 2 ; T\ R acc ) - h(xi,n) - h(x 2 ,r 2 ):, 

• [Multiple Test] If 

maxjl^riV^^ViVf )| : (r^.r^) € {y 1>Zl } x {y 2 ,z 2 },<p = 1,2} < e/2, 
return 1^ := I?'(zi, Z2), otherwise return v := +00 (return v := +00 if any of the distances above is +00). 



Figure 3.8: Routine DistortedMetric. 



4 Quartet Tests 

Before giving a complete description of the reconstruction algorithm, we introduce some important combi- 
natorial tools. As we discussed previously, we make use of a key concept from phylogenetics — the notion of 
a quartet. In the previous section, we showed how to estimate accurately distances between internal nodes 
of the tree. In this section, we explain how to use such estimates to perform topological tests on quartets. 
Those tests form the basic building blocks of the combinatorial component of the algorithm. As before, we 
fix a tree T on n leaves. Moreover, we assume that our path metric d on T satisfies d{e) > f for all e G T. 
Note that an upper bound on d is not explicitly required in this section since we assume that appropriate 
distance estimates are handed to us. 



Splits. The routine IsSplit in Figure |4~T] performs a classic test to decide the correct split of a quartet. It is 
based on the so-called Four-Point Method HBun711l . Proposition |4. 1| guarantees the correctness of IsSplit. 



Its proof is omitted. We consider once again the Basic Disjoint Setup (Dangling) of Section[3] 
Proposition 4.1 (Correctness of IsSPLIT) Consider the Basic Disjoint Setup (Dangling). Let 

Q = {VI,WI,V2,W 2 }, 
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Algorithm IsSplit 

Input: Two pairs of nodes (v\, W\), (i>2, W2); a distance matrix T> on these four nodes; 
Output: TRUE or FALSE; 

• [Internal Edge Length] Set 

v = i (X>(wi, w 2 ) + £>(vi, V2) - X>(tt>i, «i) - P(u)2,«2)) ; 

(set v = +00 if any of the distances is +00) 

• [Test] If v < f/2 return FALSE, o.w. return TRUE. 



Figure 4.1: Routine IsSplit. 



and let T> be the distance matrix on the four nodes of Q passed to IsSPLIT in Figure 4.1 Assume that 
d(e) > f for all edges on the subtree restricted to Q. If 



\d(x,y) - V(x,y)\ < 



/ 
4' 



\fx,y G Q, 



then the call IsSPLrT((«i, w{), (v 2 , W2);T>) returns TRUE, whereas the calls IsSplit((i;i, w?), (v2,wi);V) 
and IsSplit((vi,-u 2 ), (wi,w 2 ); V) return FALSE. 



Collisions. As we discussed before, for routines DistanceEstimate (Figure [34] ) and IsSplit (Fig- 
ure |4- 1 p to work, we need a configuration as in Figure |3.1| where two edge-disjoint subtrees are connected 



by a path that lies "above" them. However, for reasons that will be explained in later sections (see also the 
discussion in Section [3]), in the course of the reconstruction algorithm we have to deal with configurations 



where two subtrees are not dangling, as in Figure 3.7 We use the following definitions. Recall the Basic 



Disjoint Setup (General) in Figure 3.7 



Definition 4.2 (Collisions) Suppose that T\ and T 2 are legal subtrees of T and suppose they are not dan- 
gling. We say that T\ collides into T2 at edge e 2 = («2 , W2) (U2 is the parent of V2 in T2), if the path 

path r (/?(Ti),p(T 2 )) 

has non-empty intersection with edge e 2 (i.e., with path r (^2) V2)) but with no other edge in the subtree of 
T2 rooted at V2- See Figure\3/7\ We sometimes say that the trees T\, T2 collide. We say that the collision is 



within distance M ifd(T 1 ,T 2 ) < M. (See Definition 3.14 ) 



An important step of the reconstruction algorithm is to detect collisions, at least when they are within a 



short distance. Routine IsCOLLlSlON defined in Figure 4.3 and analyzed in Proposition 4.4 below performs 
this task. We consider the following configuration, which we call the "basic collision setup." 

Definition 4.3 (Basic Collision Setup) We have two legal subtrees T xo and T u rooted at xq and u in T. We 



let v,w be the children of u. We assume that we are in either of the configurations depicted in Figure yl2 
that is: 



a. Either the path between T XQ and T u attaches in the middle of the edge (u, v) — in other words, T s 
collides into T u at edge (u, v); 

b. Or it goes through u — in other words, T XQ and T u are dangling. 
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x : 
o 




O-. 






a) with collision 



b) without collision 



Figure 4.2: Basic setup of the IsCOLLlSlON routine in Figure 4.3 The subtrees T XQ and T u are edge- 



disjoint. The configuration on the left shows a collision, whereas the one on the right does not. Note that in 
the collision case, the internal path of the quartet {xq,w, v\, V2} is shorter than the edge (u, v). This is the 
basic observation behind the routine IsCOLLlSlON. We call xq the "reference point". 



Algorithm IsCollision 

Input: Four nodes Xq, v, w, u; an edge length h; a rooted forest and a distance matrix (T, V); 
Output: TRUE or FALSE; 

• [Children] Let Vi, v 2 be the children of v in T (or V\ — v 2 — v if v is a leaf); 

• [Internal Edge Length] Set 

v = - (D(vi,x ) + T>(v 2 ,w) - V(v l7 v 2 ) - T>(x , w)) ; 

(set v — +00 if any of the distances is +00) 

• [Test] If (h - v) > f/2 return TRUE, else return FALSE; 

Figure 4.3: Routine IsCOLLlSlON. 



We call xq the "reference point". 

The purpose of IsCOLLlSlON is to distinguish between the two configurations above. 

Proposition 4.4 (Correctness of IsCOLLlSlON) Consider the Basic Collision Setup. In particular, as- 



sume that one of the two configurations in Figure 4.2 holds. Let Q = {xq, v 1, v 2, w} and let T> and h be the 



parameters passed to IsCOLLlSlON in Figure 4.3 Assume that all edges in the tree satisfy d(e) > f. If 



and 



f 

\d(x,y) -V(x,y)\ < \/x,y e Q, 

\h — d(u, v)\ < y, 



then the routine IsCOLLlSlON returns TRUE if and only ifT X0 collides into T u at edge (n, v). 

Proof: Whether or not there is a collision, the quantity v computed in the routine is the length of the internal 
path of the quartet split xqw\v\Vi- If there is not a collision, then this path is actually equal to the path 
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corresponding to the edge (u, v). Otherwise, the length of the path is shorter than d(u, v) by at least / by 
assumption. (See Figure |4~2| ) The proof follows. ■ 



5 Reconstruction Algorithm 

We proceed with a formal description of the reconstruction algorithm. A detailed example can be found in 
Section [5T2] The reader may want to take a look at the example before reading the details of the algorithm. 



5.1 Description of the Algorithm 

Cherry Picking. Recall that in a 3-regular tree a cherry is a pair of leaves at graph distance 2. Roughly 
speaking, our reconstruction algorithm proceeds from a simple idea: it builds the tree one layer of cherries 
at a time. To see how this would work, imagine that we had access to a "cherry oracle," that is, a function 
C(u, v, T) that returns the parent of the pair of leaves {u, v} if the latter forms a cherry in the tree T (and 
say otherwise). Then, we could perform the following "cherry picking" algorithm: 

• Currently undiscovered tree: T" := T; 

• Repeat until T' is empty, 

-For all (u, v) e C{T) X £(T') , if 

w := C(u,v,T') ^ 0, set Parcnt(u) := Parent(u) := w; 

— Remove from T" all cherries discovered at 
this step. 

Unfortunately, the cherry oracle cannot be simulated from short sequences at the leaves. Indeed, as we 
discussed in Section[3j short sequences provide only "local" metric information on the structure of the tree. 



See the example in Section 5.2 for an illustration of the problems that may arise. Nevertheless, the above 
scheme can be roughly followed by making a number of modifications which we now describe. 

The high-level idea of the algorithm, which we call Blindfolded Cherry Picking (BCP), is to 
apply the cherry picking scheme above with two important differences: 



[Reconstructed Sequences] Leaf sequences provide only local metric information "around the leaves." 
To gain information about higher, internal nodes of the tree, we reconstruct sequences at the internal 
nodes of our partially reconstructed subforest and compute local metric information "around these 
nodes." By repeating this process, we gain information about higher and higher parts of the tree. 

[Fake Cherries] Moreover, because of the local nature of our information, some of the cherries we 
pick may in fact turn out not to be cherries — that is, they correspond to a path made of more than 



two edges in the true tree. (See Section 5.2 for an example.) As it turns out, this only becomes 
apparent once a larger fraction of the tree is reconstructed, at which point a subroutine detects the 
"fake" cherries and removes them. 



The full algorithm is detailed in Figures 5.3 5.4[ 5.5 and 5.6 We now explain its main components. The 



parameters k, R acc , and e will be set in Section km 
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Figure 5.1: Adding a cherry (x, z, y) to T 5 \T x ,T y }. 



Adding a Cherry. The algorithm BCP maintains a partially reconstructed subforest of the true tree, or 
more precisely, a legal subforest T of T. The main operation we employ to "grow" our partially recon- 
structed subforest is the merging of two subtrees of T at their roots — an operation we call "adding a cherry," 
in reference to the cherry picking algorithm above. Suppose the current forest T contains two edge-disjoint 
legal subtrees T x and T y . We merge them by creating a new node z and adding the edges (z, x) and (z, y) 



as in Figure 5.1 We sometimes denote the pair of edges {(z, x), (z, y)} by (x, z, y). We call this operation 



adding cherry (x, z, y) to T . 

Identifying "Local" Cherries. As we explained above, we cannot hope to identify with certainty the 
cherries of the unexplored part of the tree from short sequences at the leaves. Instead, we settle for detecting 
what we refer to as "local" cherries, roughly, cherries in a "local" neighbourhood around the roots of the 
current reconstructed subforest. More precisely, a "local" cherry is a pair of roots of the current subforest 
that passes a series of tests as detailed below. 

To determine wheth er tw o roots v\ y w\ of the current forest form a "local" cherry, our routine Lo- 
C ALC HERR\|^] in Figure 5.4 performs three tests: 



1. [Short Distance] The nodes V\ y w\ are at a short distance (roughly 2g); 

2. [Local Cherry] For all pairs of roots V2, W2 at short distance (roughly 5g), the quartet 

Q = {vi,wi,v 2 ,w 2 }, 

admits the split viWi\v2W2', 

3. [Edge Lengths] The edges connecting v\, w\ to their (hypothetical) parent are short (roughly g). 

The routine LocalCherry has three key properties, proved in Section[6} 

1 . [Edge Disjointness] It preserves the edge-disjointness of the current forest; 

2. [Restricted Subforest] It builds a forest that is always a legal restriction of the true tree; 



2 In IDMR06I . the routine was called CHERRYlD. 
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3. [Short Edges] It guarantees that all edges of the restricted forest are short (smaller than g'). 



These properties are crucial for the proper operation of the algorithm (in particular for the correctness of 
routines such as DistortedMetric (Figure [378]> and IsShort (Figure [3^6} as seen in Section[3]). 

Finally, another important property of LocalCherry is that it is guaranteed to detect true cherries — at 
least those that are "locally witnessed." We now define this notion more precisely. For a distance matrix V 
and a set of nodes M, we denote 



= max {V(x, y) : {x, y} C AT} 



Definition 5.1 (Witnessed Cherry) Let T be a forest with path metric T>. We say that a pair of leaves 
{u, v} is an M-witnessed cherry in T if{u, v} is a cherry in T and there are at least two other leaves v! , v' 
s.t. 

V(Q) < M, 

where Q = {u, v, u' , v'} (the leaves u' , v' will act as "witnesses" of the cherry {u, v}). 



Detecting Collisions. The merging of subtrees through "local" cherries that are not actual cherries even- 



tually results in collisions between subtrees of the current forest such as in Figure 3.7 Such configurations 
are undesirable since they do not allow to complete the reconstruction by simple me rging of the subtrees at 
their roots. Therefore, we seek to detect these collisions using the routine^jin Figure 5.5 



After adding a new layer of "local" cherries, DetectCollision checks whether new collisions can be 
found. For this, we use routine IsCOLLlSlON in Figure 4.3 from Section [4] We actually perform two such 



IsCOLLlSlON tests for each candidate edge in the target subtree T Ul (see Figure 5.5 1. This is done precisely 
for the same reasons that we did a "multiple test" in routine DistortedMetric (Figure [3T8] ) in Section|3| 
It guarantees that at least one test is performed under the Basic Disjoint Setup (Dangling), which is needed 
for IsCOLLlSlON to be correct. See Section[6]for details. Also, to make sure that we are always in either of 



the configurations in Figure 4.2 we scan the nodes of the target subtree T Ul in "reverse breath-first search 
(BFS) order": 



1. Order the vertices in subtree T Ul according to breath- first search v\,...,v B ; 

2. Scan the list in reverse order v s , . . . ,v\. 

This way, we never find ourselves "above" a collision in T U1 (because otherwise we would have encountered 
the collision first). 



Removing Collisions. Once a collision is identified we remove it using the routine in Figure 5.6 As 
seen in Figure 5.2 the routine essentially removes the path from the collision all the way to the root of the 
corresponding subtree. For a rooted forest T, we use the notation Parent^(x) to indicate the unique parent 
of node x. 

We note that we actually go through the "Collision Identification/Removal" step twice to make sure that 



we detect collisions occurring on edges adjacent to the roots. See the proof of Lemma 6.5 



3 In |DMR06 |, the routine was called FakeCherry. 
4 In |DMR06|, the routine was called BUBBLE. 
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Figure 5.2: The RemoveCollision in Figure 5.6 routine removes all white nodes and dashed edges. The 
dotted curvy line indicates the location of a collision to be removed. 



We prove in Proposition 7.4 below that, at each iteration of BCP, at least one cherry is added that will not 
be removed at a later stage. Hence, the algorithm makes progress at every iteration and eventually recovers 
the full tree. 



Implementation. We argue that the running time of the algorithm is 0(n 5 k) and that with the use of 
appropriate data structures it can be reduced further to 0(n 3 k). Let us start with the naive analysis. The 
distance estimations in the initialization step of the algorithm take overall time 0(n 2 k), since there are 
0(n 2 ) pairs of leaves and each distance estimation between leaves takes linear time in the sequence length k. 
Now, in every iteration of the algorithm: 

• The Local Cherry Identification step takes overall time 0(n 4 k), since we consider 0{n 2 ) pairs of 
roots in the for-loop of this step, and each call of LocalCherry requires 0(n 2 + nk) time — 0(n 2 ) 
time for the [Local Cherry] step and 0{nk) time for the [Edge Lengths] step, in which the IsShort 
call involves 0(1) ancestral sequence reconstructions on trees of size O(n). 

• The Collision Removal step requires 0(n 3 k) time in each iteration. Indeed, it performs 0(n 2 ) dis- 
tance computations and each of these takes 0(nk) time, since it requires 0(1) ancestral sequence 
reconstructions on trees of size 0(n). It also performs 0(n 2 ) calls to DetectCollision and Re- 
MOVECOLLISION, each of which is a linear time operation on trees of size 0(n). 

Since each iteration requires 0(n 4 k) time and there are 0{n) iterations (see proof in Section]?}, the overall 
running time is 0(n 5 k). The above analysis is wasteful in 1) not reusing the already reconstructed ancestral 
sequences and in 2) performing various tests on pairs of nodes that are far apart in the tree. With the use of 
appropriate data structures, we could keep track of the "local" neighborhood of each node and restrict many 
computations to these neighborhoods. This should bring down the running time to 0(n 3 k) with a constant 
that would depend explicitly on /, g. The details are left to the reader. 
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Algorithm Blindfolded Cherry Picking 
Input: Samples at the leaves {cr lL ] u e{n}\ 
Output: Estimated topology; 

• 0) Initialization: 

- [Iteration Counter] i := 0; 

- [Rooted Subforest] JF := [n]; 

- [Local Metric] For all u, v G Tq, set T> (u, v) = DistanceEstimate(u, v; Tq, R acc ); 

• 1) Local Cherry Identification: 

- Iteration: i; 

- Set := Ti\ 

- For all € C ( f°), 

* [Main Step] Compute (IsCherry, /„,!„,) := LocalCherry ((v 1} wi); (Ti,Vi)); 

* If IsCherry = TRUE, 

• [Update Forest] Create new node u\ and add cherry (ui, iti, ti>i) to .Fi+i; 

• [Edge Lengths] Set h{u\,V\) := l v and w\) := l w ; 

• 2) Collision Removal: 

- [Update Metric] For all x%,X2 € /5(J r i+i), for all u v g T^ +1 , <p — 1, 2, set 

UI1U2) — DlSTORTEDMETRIC(Mi , U2 {/l(e)} e g^r i+1 ; -Race! s) i 

- Set J 7 := and P := V i+ i; 

- For all (uq,Ui) G p(T) x p^), 

* Set HasCollision := FALSE; 

* If Mi is not a leaf, 

• [Main Step] Compute (HasCollision, z) := DetectCollision ((u , mi); (J 7 , T>)); 

* If HasCollision = TRUE, 

• [Update Forest] Compute := RemoveCOLLISIOn(z; 

- [Second Pass] Set T := Ti+\ and repeat the previous step; 

• 3) Termination: 

- If <3, 

* Join nodes in p^+i) (star if 3, single edge if 2); 

* Return (tree) JF i+1 ; 

- Else, set i := i + 1, and go to Step 1. 



Figure 5.3: Algorithm Blindfolded Cherry Picking. 



5.2 Example 



We give a detailed example of the execution of the algorithm. Consider the tree depicted in Figure 5.7 1. It 
is made of a large complete binary tree (on the right) with a small 3-level complete binary tree attached to 
its root (on the left). All edges have length g, except (v, x), (x, v') and the edges adjacent to the root of the 
small tree which have length g/2. In the figure, the subtree currently discovered by BCP is made of solid 
arrows and full circles. The remaining (undiscovered) tree is in dotted lines and empty circles. Assume 
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Algorithm LocalCherry 

Input: Two nodes (v\, w\); current forest and distance matrix (F, V); 
Output: Boolean value and length estimates; 

• Set IsCherry := TRUE and l v = l w = 0; 

• [Short Distance] 

- If V(vi,wi) > 2g + e, then IsCherry := FALSE; 

• [Local Cherry] 

- SetAA= {(v 2 ,w 2 ) e ( P( P) :V{{v 1 ,w 1 ,V2,w 2 })<5g + e}; 

- If AT is empty, then IsCherry := FALSE; Else, for all (v 2 , w 2 ) € TV, 

* If IsSplit ((vi,Wi), (v 2 ,w 2 );V) = FALSE then set IsCherry := FALSE and break; 

• [Edge Lengths] 

- If IsCherry = TRUE, 

* Let x\ , x 2 be the children of v\ in T (or let x\ = x 2 — v\ if v\ is a leaf); 

* Let zq be the closest node to v\ in p(!F) — {v\, w\} under T>; 

* Set (b v ,l v ) := IsSHORT ([x 1 ,x 2 ), {w 1 ,z );T; R acc ; e/16); 

* Repeat previous steps switching the roles of v\ and w\ ; 

* Set IsCherry := b v hb w ; 

• Return (IsCherry, l v ,l w ); 



Figure 5.4: Routine LocalCherry. 



Algorithm DetectCollision 

Input: Two roots uq, u\ \ directed forest and distance matrix (T , V); 
Output: Boolean value and node; 




• Set HasCollision := FALSE and z := 0; 




• Let x , yo be the children of Uo in T\ 




• Scan through all nodes v in T Ul (except m) in reverse BFS manner, 




- Let w := Sister^-(u) and u := Parent^r(w); 




- [Collision Test] Compute 




b x := IsCOLLlSlON(a;o, v, w, u; h(u, 




and 

by := IsCOLLlSlON(y , v, w, u; h(u, 


«);(^,©)); 


- If HasCollision := b x A b y = TRUE then set z := v and break; 




• Return (HasCollision, z); 





Figure 5.5: Routine DETECTCOLLISION. 



that the length of the sequences at the leaves allows us to estimate accurately distances up to 5g (the actual 
constants used by the algorithm can be found later). 
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Algorithm RemoveCollision 
Input: Node v\ rooted forest T; 
Output: Rooted forest; 

• If v is not in T or v is a root in T, return T\ 

• Let zo be the root of the subtree of T in which v lies; 

• Set x := v; 

• While i^z , 

- Set x := Parent^- (a;); 

- Remove node x and its adjacent edges below it; 

• Return the updated forest T. 

Figure 5.6: Routine RemoveCollision. 



First Level. We first join into cherries all pairs of leaves that "look" like ^-cherries in this "local" metric. 
We are guaranteed to find all true g-cherries. However, consider pairs of leaves such as u,d for which there 
is no "local evidence" that it does not form a cherry. Even though u, v is not a cherry, it is joined into a cherry 
by BCR Figure 5.7 1 depicts the current forest after the first iteration of BCP. Before proceeding further, we 
apply our estimator Anc from Section [2] to obtain reconstructed sequences at the roots of the current forest 
and recompute the "local" metric. 



Removing a Fake Cherry. We subsequently proceed to join "local" ^-cherries one layer at a time, recon- 
structing internal sequences as we do so. After many iterations, we find ourselves in the situation of Fig- 
ure 5.7 3 where most of the large complete tree has been reconstructed (assume for now that edges (u', v"), 



(v',v"), (u,v'), (v,v') represented in dashed lines are present). Now, the new information coming from 
sequences at y%, . . . , 2/4 provides evidence that (u, v' , v) is not a cherry and that there is in fact a node x 
on edge (v, v'). For example, the quartet {y\,y2, u, v} suggests that u, v forms a cherry with a 3<?/2-edge, 
which cannot hold in a true cherry. At this point, we remove the "fake" cherry (u,v',v) as well as all 
cherries built upon it, here only (V, v" , v'). Note that we have removed parts of the tree that were in fact 
reconstructed correctly (e.g., the path between u and u'). 



Rediscovering Removed Parts. Subsequently, BCP continues to join "local" cherries and "rediscovers" 
the parts of the tree that were removed. For instance, in Figure 5.7 the edge (V, v") 
but this time it forms a cherry with (u" , v") rather than (y 1 , v"). 



is reconstructed again 



Final Step. Eventually, the full tree is correctly reconstructed except maybe for a few (at most 3) remaining 
edges. Those can be added separately. For example in Figure 5.7 i only the three edges around x remain to 
be uncovered. Note that the reconstructed tree has a root which is different from that of the original tree. 



6 Analysis I: Induction Step 

In this section and the next, we establish that BCP reconstructs the phylogeny correctly. In this section, we 
establish a number of combinatorial properties of the current forest Ti grown by BCP. Then, in the next 
section, we prove that the "correctly reconstructed subforest" of Ti increases in size at every iteration. 
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a) first level of local cherries b) BCP removes a fake cherry 




c) with the extra information, d) only three extra edges need to be added 

BCP rediscovers part of the tree 



Figure 5.7: Illustration of BCP's unraveling. 
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Parameters. Let S > be the error tolerance. Each application of Proposition |3.15| has an error of 
0(n~ 7 ). We will do 0(n 3 ) distance estimations so that by the union bound we require 0(n 3 ~ 7 ) < 5. Let 
e < | min{/, g' - g] and i? co i > 6g. Set M > R co \ + 4g'; take i? acc > M + 2B(g') + 4g'; and choose to 
be equal to the max imum sequence length requirement for Propos ition 



for Proposition 3.2 with parameters e and M, and for Proposition 



3.13 



3.15 



with parameters e, M and i? a 
with parameters e/16, M and l? ac 



Induction Step. The following proposition establishes a number of properties of the forest grown by BCP 



We assume here that the conclusion of Proposition 3.15 holds at every iteration of the algorithm, until the 



tree is fully recovered. In the next section, we will prove that the latter is indeed true with high probability. 
Proposition 6.1 (Properties of Ti) Denote by Ti = {T u : u E p(T)} the current forest at the beginning 



of BCP 's i-th iteration. Also, assume that the conclusion of Proposition 3.15 holds for each call of the 
routine DlSTORTEDMETRIC throughout the execution of the algorithm. Then, Vi > (until the algorithm 
stops), 

1. [Legal Subforest] T is a legal subforest ofT; 

2. [Edge Lengths] Vu G p{3~~i), T u has edge lengths at most g' ; 

3. [Weight Estimation] The estimated lengths {h(e)} e£ £ cj^a of the edges in T% are within e/16 of their 
right values; 



4. [Collisions] There is no collision within distance R co i- (See Definition 4.2 ) 
Proof of Proposition |6.1 \ The proof is by induction on i. 

i = 0: The set p(To) consists of the leaves of T. The claims are therefore trivially true. 

i > 0: Assume the claims are true at the beginning of the j-th iteration for all j < i. By doing a step-by-step 
analysis of the i-th iteration, we show that the claims are still true at the beginning of the (i + l)-st iteration. 



We first analyze the routine LocalCherry (Figure 5.4 1. For a legal subforest T of T, we denote the 
"remaining" forest by 

T = T - T. 

More precisely, if T = {Ti, . . . , T a } then T is the forest obtained from T as follows: 

1. Remove all edges of T in the union of the trees Ti, . . . , T a . In particular, for those edges of a T, 
representing a path in T, we remove all corresponding edges of T. 

2. The nodes of T are all the endpoints of the remaining edges of T. All other nodes of T are discarded. 
Note that the set T is in fact a subforest of T. 

Lemma 6.2 (Local Cherry Identification) Let T{ be the current forest at the beginning of the i-th itera- 
tion. Then we have the following: 

• If {v i, wi} is a bg-witnessed cherry in Ti, then it passes all tests in LOCALCHERRY; 



29 



• If{vi,wi} passes all the tests in LocalCherry, then 

d(ui,vi) < g + 2e, 

and 

d(«i,ioi) < g + 2e, 

where u\ is the parent of {vi,w\} (as defined by LOCALCHERRY and the [Update Forest] step in the 
main loop). Moreover the length estimates satisfy 

\d(ui,vi) - l v \ < 

lb 

and 

\d(ui,wi) - l w \ < — . 

lb 

• If two pairs {v±,wi} and {v2,W2} both pass all tests in LOCALCHERRY, then it must be that the 
paths path T (ui, w\) and path T (t>2, W2) are non-intersecting. 

Proof: Suppose first that {v\, w\} is a 5<?-witnessed cherry in Ti with witness {^2, 1^2}- Since there is no 
collision within R co \ > bg by assumption, we are in the dangling case. Moreover, all edge weights below 
{v\, Wi, i?2, W2} have been estimated within e/16 and all corresponding edge weights (possibly correspond- 



ing to paths) are < g' . Therefore, by Propositions 3.15 and 4. 1 DistortedMetric (Figure 3.8 ) is accurate 



within e, IsSplit (Figure 4.1 1 returns the correct answer and IsShort (Figure 3.6 1 is accurate within e/16. 
Hence, {y\, w{\ passes the three tests in LocalCherry. 

Conversely, suppose {v\, w\} passes all tests in LocalCherry. By our assumptions, when Distort- 
edMetric returns a finite value, it is accurate within e. Let zq be as in Figure [5T4] (that is, zo is the closest 
node to v\ in p(F) — {v\,w\} under the distorted metric). The fact that {v\,w{\ previously passed the 
[Local Cherry] test implies in particular that the diameter of {vi, Wi, Zo} must be less than distance hg + 2e. 
In particular, there is no collision between the subtrees rooted at v\, w\, and zq since R co \ > bg. Let ui be 
the intersection of {vi,w±, zq}. Then, IsShort returns an estimate within e/16 < e which in turn implies 
that d(u\, v\) and d(u\, w\) are < g + 2e. 

For the third part, assume by contradiction that the paths path T (t>i, w\) and path T (v2, ^2) intersect. 
Then by the triangle inequality, the diameter of {v\, w\, V2, W2} is at most 5g. In particular, when Lo- 
calCherry is applied to {y%, wi}, the pair {^2,^2} is considered in the [Local Cherry] test and, since 
there is no collision within l? co i > bg, IsSplit correctly returns FALSE. That is a contradiction. ■ 



We can now prove Claims [T]|2j and [3] of Proposit ion|6.1 



Lemma 6.3 (Claims[T}[2j and[3]l Let J-i+i be the current forest at the beginning of the (i + l)-st iteration. 
The Claims^^ and^ofthe induction hypothesis hold for Ti+\. 



Proof: Since RemoveCollision (Figure p3| > only removes edges from the current forest, it is enough 
to prove that after the completion of the Local Cherry Identification step the resulting forest satisfies 
Claims [TJ [2j and [3] 

Claim [1} By the induction hypothesis, Ti is legal. We only need to check that J^+i is edge-disjoint. 
Suppose on the contrary that the forest is not edge-disjoint. Also, suppose that, along the execution of the 
Local Cherry Identification step, the forest stops being edge-disjoint when cherry (v\, ui,w±) is added to 
Ti. Then one of the following must be true: 
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1. There is an "old" root z G p{Ti) such that path r (t>i, w{) is edge-sharing with But then there is 
a collision in ^ within distance 2g + 2e which contradicts the induction hypothesis (Claim [4]). 

2. There is a "new" root z S p(Ti+\)\p(Ti) with corresponding cherry (x, z, y) such thatpath T (wi, 
is edge-sharing with path T (x, y). In that case v\W\\xy is not the correct split and 



by the triangle inequality and Proposition 3.15 But then, by Lemma|6T2j LocalCherry (Figure 5.4 1 



rejects {v\,wi} when performing the [Local Cherry] test with witness {x, y} — a contradiction. 
Claim [2] and |3j Follows from the induction hypothesis and Lemma 6.2 ■ 



It remains to prove Claim |4] of Proposition |6.1| This follows from the following analysis of Detect- 
COLLISION (Figure 5.5). Note that, since Claim [4] holds for all iterations j < i, it must be the case that 



any new collision between two trees involves at least one of the new edges of these trees added in the Local 
Cherry Identification step. We call an edge deep if it is not adjacent to a root in the current forest. Otherwise 
we call the edge a top edge. We show that the first pass of the Collision Removal step removes all collisions 
into deep edges. At the beginning of the second pass, all collisions (if any) must be into top edges. We show 
that the second pass cleans those up. We first prove that there are no false positives in the Collision Removal 
step. 

Lemma 6.4 (Collision Removal: No False Positive) Let be the current forest at the beginning of the 
first or second pass of the Collision Removal step of the i-th iteration, and let Uq,u\ G p(Fi-\-i,x) be the roots 
of the trees Tq = T Uo !+11 and T\ = T Ul 1+1,1 . Let v be a node in T\, and suppose that Tq does not collide 
into T\ below v or on the edge immediately above it. Then no collision is detected in the corresponding step 
of DetectCollision. 

Proof: Let xq, yo be the children of uq. It suffices to show that either b x is FALSE or b y is FALSE (see 



Figure 5.5). Without loss of generality we can assume that the path connecting uq to u\ does not pass 



through xq. In particular, we are in the case b) of Figure 4.2 If any of the distances passed to IsCOLLlSlON 



is +00, IsCOLLlSlON returns FALSE. Otherwise, by our assumption on the output of DistortedMetric, 



the assumptions of Proposition 4.4 are satisfied. Hence, IsCOLLlSlON returns FALSE in that case as well. 



Lemma 6.5 (Collision Removal) The first and second passes of the Collision Removal step satisfy: 



1. Let J-i+n be the current forest at the beginning of the first pass of the Collision Removal step of the 

jr. jr. 

i-th iteration, and let uq, u\ £ pifFi+x 1). Suppose Tq = T«o collides into T\ = TQ_ ' within 



distance R co \ on a deep edge e = (u,v) ofT\. Then DETECTCOLLISION in Figure 5.5 correctly 
detects the collision. 



2. Let .Fi+1,2 be the current forest at the beginning of the second pass of the Collision Removal step of 
the i-th iteration, and let uq,u\ £ piFi+ip)- Suppose Tq = T u * +1 ' 2 collides into T\ = T Ul l+1 ' 2 within 
distance R co \- Then DETECTCOLLISION correctly detects the collision. 



Proof: 1. Denote by x v , y v the children of u v , <p = 0, 1. Consider the Basic Collision Setup of Section|4] 
By the remark above the statement of the lemma, the path coming from e enters Tq through a top edge, or at 



uq. (See Figure 6.1 ) Let 
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j4o_>i = {z£ V(Ti) : e is not in the subtree of T\ rooted at z} , 

and 

Bo_i=V(r a )\i4o-i. 

Denote by z the current-node variable used by DetectCollision as it scans the tree T\ in a reverse BFS 



manner. Observe that, for all z G — {v}, the Basic Collision Setup of Proposition 4.4 holds for both 



xq and yo- Hence, by Lemma 6.4 IsCOLLlSlON in Figure 4.3 returns the correct answer. Furthermore, in 



the case of v, IsCOLLlSlON returns TRUE for both xo and yo since the collision is within R co \ and 

d{{x ,y ,vi,v 2 ,w}) < R co i + 4g' < M, 

byClaim|2] Forw, IsCOLLlSlON returns FALSE because in that case h(u, w) — v < —f + 3e < f/2, where 
v is the estimated length of the internal edge of {xq,w\, W2, v} or {yo, Wi, W2,v} with w\, W2 the children 
of w. Finally, by scanning the nodes in reverse BFS order, we guarantee that v and w are encountered before 
any node in £?o-»i- Hence, DetectCollision identifies correctly the collision on edge e. 

2. Observe that, after a call to RemoveCollision, the set of edges in the remaining forest is a subset of 
what it used to be. In particular, the subset of these edges involved in a collision decreases in size. Moreover, 
from the first part of the lemma, at the end of the first pass there is no collision remaining into deep edges. 
Given this, the argument above implies that any remaining collision within distance i? co i will be found and 
removed in the second pass. ■ 



7 Analysis II: Tying It All Together 



In the previous section, we showed that provided Proposition 3.15 holds at every iteration the forest built is 



"well-behaved." Below, we finish the proof of our main theorem by showing that Proposition 3.15 indeed 
holds until termination and that the algorithm eventually converges. We also discuss the issues involved in 



extending 0(log n) -reconstruction beyond the A-Branch Model in Section 7.3 
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7.1 Quantifying Progress 



Our main tool to assess the progress of the algorithm is the notion of a fixed subforest — in words, a subforest 
of the current forest that will not be modified again under the normal operation of the algorithm. 

Definition 7.1 (Fixed Subforest) Let J 7 be a legal subforest ofT. Let u G V(T). We say that u is fixed if 
is fully reconstructed, or in other words, can be obtained from T by removing (at most) one edge 
adjacent to u and keeping the subtree containing u. Note that descendants of a fixed node are also fixed. We 
denote by T* the rooted subforest of T made of all fixed nodes of T. We say that T* is the maximal fixed 
subforest of T. 

Let Ti = {Ti, . . . , T a } be the current forest at iteration i with remaining forest Ti = {T{, . . . , TL}. As- 



suming the conclusion of Proposition 6.1 holds, each leaf v in T% satisfies exactly one of the following: 

• Fixed Root: v is a root of a fully reconstructed tree T a G T (that is, T a is in T*); 

• Colliding Root: v is a root of a tree T a £ T that contains a collision (that is, T a is not in T*); 

• Collision Node: v belongs to a path connecting two vertices in T a G T% but is not the root of T a (that 
is, it lies in the "middle" of an edge of T a ). 

Note in particular that the fixed roots of Ti are roots in T* (although not all roots of the maximal fixed 
subforest T* are fixed roots in T as they may lie below a colliding root). We also need the notion of a fixed 
bundle — in words, a cherry (along with witnesses) that will be picked by the algorithm at the next iteration 
and remain until termination. 

Definition 7.2 (Fixed Bundle) A bundle in Ti is a group of four leaves in T such that: 

• Any two leaves in the bundle are at topological distance at most 5 in T; 

• It includes at least one cherry of Ti. 

A fixed bundle is a bundle in Ti whose leaves are fixed roots. 

The following lemma is the key to our convergence argument. It ensures that a fixed bundle always exists 
and hence that progress is made at every iteration. 

Proposition 7.3 (E xiste nce of a Fixed Bundle) Assume Ti has at least 4 leaves and satisfies the conclu- 



sion of Proposition 6.1 Then, Ti contains at least one fixed bundle. 



Proof: To avoid confusion between the forests Ti and Ti, we will refer to the forest Ti as the anti-forest, to 
its trees as anti-trees and to its leaves as anti-leaves. We first make a few observations: 

1. The anti-forest Ti is binary, that is, all its nodes have degrees in {1,3}. Indeed, note first that since 
T is binary, one cannot obtain nodes of degree higher than 3 by removing edges from T. Assume by 
contradiction that there is a node, u, of degree 2 in T. Let wi,W2,ws be the neighbors of u in T 
and assume that w% £ T. By construction of the anti-forest T the edge (u, Ws) is in the forest T 
(possibly included in an edge of T corresponding to a path of T). Moreover, u is a node of degree 1 
in T%. But this is a contradiction because the only nodes of degree 1 in a legal subforest of T are the 
leaves of T and u cannot be a leaf of T. 
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2. A binary tree (or anti-tree) To with 4 or more leaves (or anti-leaves) contains at least one bundle. 
Indeed, let Lq be the leaves of To. We call Rq = Lq the level-0 leaves of To. Now remove all cherries 
from To and let Ri be the roots of the removed cherries — the level-1 leaves. Denote by T\ the tree so 
obtained and note that its leaves L\ contain R\ as well as some remaining level-0 leaves. Consider 
the cherries of T\ (there is at least one). If any such cherry is made of two level-1 leaves, then the 
corresponding (descendant) level-0 leaves of To form a bundle in To and we are done. (In that case, 
the diameter of the bundle is 4.) Suppose there is no such cherry. Note that there is no cherry in 
Ti formed from two level-0 leaves as those were removed in constructing T\. Hence, all remaining 
cherries of T\ must contain at least one level-1 leaf. Now, merge all such cherries to obtain T%. Denote 
by i?2 the roots of the removed cherries — the level-2 leaves. Once again, by the argument above, any 
cherry of T2 contains at least one level-2 leaf. Any such cherry (there is at least one) provides a 
bundle by considering its descendant level-0 leaves in To. (If the second leaf in the cherry is level-0, 
the diameter of the bundle is 4. If it is level-1 or level-2, the diameter is 5.) This proves the claim. 



3. By Claim|4jin Proposition |6.1[ there is no collision within i? co i > Qg (see Definition |4.2| ). In particular, 
collision nodes are at distance at least i? co i > 6g from any other anti-leaf (collision nodes, fixed roots, 
colliding roots) in Ti. Therefore, if an anti-tree in Ti contains a collision node, then it has > 4 
anti-leaves and, from the previous observation, it contains at least one bundle. Moreover, this bundle 
cannot contain a collision node since in a bundle all anti-leaves are at distance at most 5g and collision 
nodes are at distance at least R co \ > Qg from all other anti-leaves in T. 

4. From the previous observations, we get the following: if a tree in T contains a collision, then either 
it has a fixed bundle or it has at least one colliding root. 

We now proceed with the proof. Assume first that there is no collision node in T. Then, there cannot 
be any colliding root either because by definition a colliding root is the root of a tree containing a collision. 
In particular, T is composed of a single anti-tree whose anti-leaves are all fixed roots. Then, since by 
assumption T has at least 4 anti-leaves, there is a fixed bundle by Observation [2] above. 

Assume instead that there is a collision node. Let T° be an anti-tree in T with such a collision node, 
say c°. Then by Observation [4] either 1) T° has a fixed bundle in which case we are done, or 2) one of T°'s 
anti-leaves is a colliding root, say r°. In the latter case, let T° be the tree in T whose root is r°. The tree T° 
contains at least one collision node (included in an edge corresponding to a path of T). This collision node, 
say c , is also contained in an anti-tree in T, say T . Repeat the argument above on T , and so on. 

We claim that this process generates a simple path P in T starting from the node c° above, passing 
through an alternating sequence of colliding roots and collision nodes r° ,c x ,r l ,c 2 , . . ., and eventually 
reaching a fixed bundle. Indeed, since there is no cycle in T, the only way for P not to be simple is for 
it to "reverse on itself." But note that by definition, for all j we have cP 7^ r J and r J / 6> +1 . Moreover the 
simple paths d> — > r J and r J — > c J+1 belong respectively to the anti-forest Ti and the forest Ti (possibly 
with subpaths collapsed into edges). In particular, their edges (in T) cannot intersect. Hence, P is simple. 
Finally, since T is finite, this path cannot be infinite, and we eventually find a fixed bundle. ■ 



7.2 Proof of the Main Theorem 

Consider now the A-Branch Model. The proof of convergence works by considering first the hypothetical 
case where all distance estimates computed by the algorithm are perfectly accurate, that is the case where 
we have "perfect" local information. We denote by {(J r o)j}j>o the sequence of forests built under this 
assumption. Note that, up to arbitrary choices (tie breakings, orderings, etc.), this sequence is deterministic. 
We now show that it terminates in a polynomial number of steps with the correct tree. 
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Proposition 7.4 (Progress Under Perfect Local Information) Assume {To)i = {T u : u£ p{{Ta)i)} is 
the current forest at the beginning of BCP's i-th iteration under perfect local information with correspond- 
ing maximal fixed subforest {To)*. Then for all i > (before the termination step), {To)* Q {To)t+i anc ^ 
|V((^b)* +1 )| > 



Proof of Proposition 



7.4 



We first argue that {To)* Q {To)* +1 . Note that the only routine that removes 
edges is RemoveCollision in Figure [53] Since RemoveCollision only removes edges above iden- 
tified collisions and {To)* is fully reconstructed, it suffices to show that collisions identified by Detect- 
COLLISION in Figure [53] are actual collisions. This follows from Lemma [64] 

To prove \V{{T )* +1 )\ > |V((JF )*)|, assume {T )i = {T u ...,T a } with remaining forest {f )i = 
{T{, . . . , To}. From Proposition 



it follows that {To)i contains at least one fixed bundle. This immedi- 
ately implies the second claim. Indeed, by Lemma [672] note that the cherry in the fixed bundle is found by 



7.3 



LocalCherry in Figure 5.4 during the {i + l)-st iteration and is not removed by the Collision Removal 
step from the argument above. ■ 

Proof of Theorem [1} To prove our main theorem under the A-Branch Model, we modify our reconstruction 



algorithm slightly by rounding the estimates in Proposition 3.15 to the closest multiple of A. Also, we 
choose a number of samples large enough so that the distance estimation error is smaller than A/3. In that 
case, we simply mimic the algorithm under perfect local information. Note that by Proposition |7 .4| there are 
at most 0{n) iterations until termination under perfect local information. By a union bound, it follows that 
Proposition 3.15 holds true for all pairs of subtrees in {{To)i]i>o with high probability. 



We can now conclude the proof. By Proposition |6.1| the current forest at each iteration is correctly 



reconstructed. By Proposition 7.4 after 0{n) iterations there remain at most three nodes in p{T) and at 



that point, from Proposition 6. 1 Claim [4j we have that T = T*. Therefore the remaining task is to join the 
remaining roots and there is only one possible topology. So when the BCP algorithm terminates, it outputs 
the tree T (as an undirected tree) with high probability. 

The tightness of the value g* = ^ is justified by the polynomial lower bound [Mos04] on the number 
of required characters if the mutation probability p on all edges of the tree satisfies 2(1 — 2p) 2 < 1. ■ 



7.3 Beyond the A-Branch Model? 

Extending Theorem [T] to continuous edge lengths appears far from trivial. The issue arises in the final 



union bound over all applications of Proposition 3.15 (more specifically, the ancestral state reconstruction 



step) which is valid except with inverse polynomial probability over the generated sequences (for a fixed 
subtree of T). Indeed, note that in general there are super-polynomially many restricted subtrees of the 
true tree T where all edges (paths in T) are shorter than g* . Therefore, using only a simple union bound, 
we cannot hope to guarantee that ancestral state reconstruction is successful simultaneously on all possible 
partial reconstructed subtrees. 

In the previous subsection, we avoided this problem by showing that under the A-BM assumption the 
algorithm follows a deterministic realization path of polynomial length. Moving beyond this proof would 
require proving that the ancestral state reconstruction can be performed on the random forests generated 
by the algorithm — but this is not straightforward as the partially reconstructed forest is generated by the 
same data that is used to perform the ancestral estimation. We conjecture that the correlation created by 
this "bootstrap" is mild enough to allow our algorithm to work in general, but we cannot provide a rigorous 
proof at this point. 

We remark that Mossel's earlier result [Mos04] on the balanced case with continuous edge lengths is 
not affected by the issue above because, there, the reconstruction of the tree occurs one level at a time (there 
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is no collision). Hence, ancestral state reconstruction is performed only on fully reconstructed subtrees — of 
which there are only polynomially many 

Note finally that, even under the discretization assumption made in this paper, achieving log(ra) -recons- 
truction is nontrivial and does not follow from previous techniques. In particular, it can be shown that 
all previous rigorous reconstruction algorithms for general trees require polynomial sequence lengths even 
when all edge lengths are identical and below g*. See MRoc081 for a formal argument of this type. 



8 Conclusion 



The proof of Steel's Conjecture USteOll provides tight results for the phylogenetic reconstruction problem. 
However, many theoretical and practical questions remain: 

• Can the discretization assumption be removed? We conjecture that the answer is yes. 

• Can the results be extended to other mutation models? In subsequent work, Roch HRoc091 showed 
that our results hold for so-called General Time-Reversible (GTR) mutation models below the Kesten- 
Stigum bound. Can the results be extended to deal with "rates across sites" (see e.g. [Fel04])? 

• What is the optimal g-value for the Jukes-Cantor model? Is it identical to the critical value g q= i of 
the reconstruction problem for the so-called "Potts model" with q = 4? We note that it is a long 
standing open problem to find g g= 4. The best bounds known are given in [MP03, MSW04a|. See 
also IIMM06l[STy08l . 
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